Parenthetical Classification for Information Extraction
نویسندگان
چکیده
The article focuses on a rather unexplored topic in NLP: parenthetical classification. Parentheticals are defined as any text sequence between parentheses. They have been approached from isolated perspectives, like translation pairs extraction, but a full account of their syntactic and semantic properties is lacking. This article proposes a new comprehensive scheme drawn from corpus-based linguistic studies on French news. This research is part of a project investigating the structural aspects of punctuation signs and their usefulness for Information Extraction. Parenthetical classification is approached as a relation extraction problem split into three correlated subtasks: syntactic and semantic classification and head recognition. Corpus-based studies singled out 11 syntactic and 18 semantic relation subtypes. The article addresses automatic classification, using a combination of CRF and SVM. This baseline system reports 0.674 (head recognition), 0.908 (syntax), 0.734 (semantics), and 0.518 (end-to-end) of F1. TITLE AND ABSTRACT IN ANOTHER LANGUAGE, FRENCH (FR) Classification des parenthétiques pour l’extraction d’information Définies dans cet article comme du texte entre parenthèses, les parenthétiques ont été jusqu’à présent peu étudiées en TALN. Si elles ont fait l’objet d’études particulières telles que l’extraction de paires de traduction, il manque une approche globale des relations syntaxiques et sémantiques qui les rattachent à leur contexte. Cet article propose un nouveau schéma de classification élaboré à partir d’études de corpus de presse. Cette recherche s’inscrit dans un projet explorant les aspects structurants des signes de ponctuation et leur utilité en Extraction d’Information. La classification des parenthétiques est abordée sous l’angle de l’extraction de relations et divisée en trois sous-tâches : classification syntaxique et sémantique et reconnaissances des têtes. Les études de corpus ont fait émerger 11 classes syntaxiques et 18 classes sémantiques. L’article propose d’évaluer un système combinant CRF et SVM. La baseline obtenue est de 0,674 (reconnaissance des têtes), 0,908 (syntaxe), 0,734 (sémantique) et 0,518 (toutes tâches confondues) de F-mesure.
منابع مشابه
A Lexicon of French Quotation Verbs for Automatic Quotation Extraction
Quotation extraction is an important information extraction task, especially when dealing with news wires. Quotations can be found in various configurations. In this paper, we focus on direct quotations introduced by a parenthetical clause, headed by a “quotation verb”. Our study is based on a large French news wire corpus from the Agence France-Presse. We introduce and motivate an analysis at ...
متن کاملA Discriminative Approach to Japanese Abbreviation Extraction
This paper addresses the difficulties in recognizing Japanese abbreviations through the use of previous approaches, examining actual usages of parenthetical expressions in newspaper articles. In order to bridge the gap between Japanese abbreviations and their full forms, we present a discriminative approach to abbreviation recognition. More specifically, we formalize the abbreviation recognitio...
متن کاملDeveloping a New Method in Object Based Classification to Updating Large Scale Maps with Emphasis on Building Feature
According to the cities expansion, updating urban maps for urban planning is important and its effectiveness is depend on the information extraction / change detection accuracy. Information extraction methods are divided into two groups, including Pixel-Based (PB) and Object-Based (OB). OB analysis has overcome the limitations of PB analysis (producing salt-pepper results and features with hole...
متن کاملDevelopment of an Automatic Land Use Extraction System in Urban Areas using VHR Aerial Imagery and GIS Vector Data
Lack of detailed land use (LU) information and efficient data collection methods have made the modeling of urban systems difficult. This study aims to develop a novel hierarchical rule-based LU extraction framework using geographic vector and remotely sensed (RS) data, in order to extract detailed subzonal LU information, residential LU in this study. The LU extraction system is developed to ex...
متن کاملFeature selection using genetic algorithm for classification of schizophrenia using fMRI data
In this paper we propose a new method for classification of subjects into schizophrenia and control groups using functional magnetic resonance imaging (fMRI) data. In the preprocessing step, the number of fMRI time points is reduced using principal component analysis (PCA). Then, independent component analysis (ICA) is used for further data analysis. It estimates independent components (ICs) of...
متن کامل